\section{Generating overspecified descriptions} 
\label{sec:overspecification}

As it stand, Algorithm~\ref{algo:bisim-l} allows very little overspecification in the REs it
generates.  A relation with a low \puse\ might be enough by 
(by itself or in combination with some of the relations already considered) to 
identify the target. Once this relation is added, we obtain an RE, but a shorter, 
more specific RE might be possible, by eliminating some of the previous refinements. 
Hence, the resulting RE might be overspecified. This is the same kind of overspecification 
that the original incremental algorithm allows.  But it has been argued~\cite{Engelhardt_Bailey_Ferreira_2006,Arts_Maes_Noordman_Jansen_2011} that 
a much higher degree of overspecification is usually found in corpora, and this 
is indeed what can be seen in the GRE3D7 corpus.  As we can see in Table~\ref{corpus-distribution}, 
the target is described 16.43\% of the times as a ``small green ball'' when ``green 
ball'' is already an RE.  Using the \puse\ values learnt from the calculus explained
in the previous section, Algorithm~\ref{algo:bisim-l} cannot simulate this behavior. 

Because the fundamental idea of the algorithm is semantics, handling overspecification in 
a natural way is difficult. If two properties have the same interpretation in a given 
model, then once the first has been considered the second will not refine the classes 
obtained so far, and hence the algorithm won't include it in the generated descriptions. 
On the other hand, if we disregard the condition the informativeness constrain (i.e., 
the fact that the addition of a relation indeed refine the class, eliminating some of 
the elements it contains), then we run the risk of generating descriptions like ``the green 
green ball.''

As a compromise, we consider the following variation of Algorithm~\ref{algo:bisim-l} were 
we disregard the informativeness constrain (i.e., we allow the inclusion of new relations 
in the description, even if they do not refine the associated class) \emph{but only during the 
first loop of the algorithm}.  That is, during the first loop over the elements in the 
input list Rs, we will allow the inclusion of all relations that do not trivialize the 
description (i.e., the associated class is not empty).  Because this is done only during 
the first loop, we know that repeated properties will not appear in the generated REs.  
In the remaining loops, additional properties will be added only if they are informative. 
%The modified algorithm is shown in Figure~\ref{fig:algo3}.

%%\newcommand{\nBlue}{\mathit{blue}\xspace}
%%\newcommand{\nGreen}{\mathit{green}\xspace}
%%\newcommand{\nSmall}{\mathit{small}\xspace}
%%\newcommand{\nBig}{\mathit{big}\xspace}
%%\newcommand{\nBall}{\mathit{ball}\xspace}
%%\newcommand{\nCube}{\mathit{cube}\xspace}
%%\newcommand{\nOntop}{\mathit{ontop}\xspace}
%%\newcommand{\nTop}{\mathit{top}\xspace}
%%\newcommand{\nBelow}{\mathit{below}\xspace}
%%\newcommand{\nRightof}{\mathit{rightof}\xspace}
%%\newcommand{\nLeftof}{\mathit{leftof}\xspace}
%%\newcommand{\nLeft}{\mathit{left}\xspace}

%\begin{figure}[ht]
%\begin{minipage}[b]{0.42\linewidth}
%\centering
%\includegraphics[width=\textwidth]{images/3.jpg}
%\vspace*{-.25cm}
%\vspace*{-.4cm}\caption{Input scene}
%\label{GRE3D7-stimulus}
%\end{minipage}
%\hspace*{-0.2cm}
%\begin{minipage}[b]{0.6\linewidth}
%\centering
%\begin{tikzpicture}
%  [
%    n/.style={circle,fill,draw,inner sep=3pt,node distance=1.4cm},
%    aArrow/.style={->, >=stealth, semithick, shorten <= 2pt, shorten >= 2pt},
%  ]
% \node[n,label=above:$e_1$,label=below:{
%    \relsize{-1}$\begin{array}{c}
%      \nLeft\\[-2pt]
%      \nSmall\\[-2pt] 
%      \nBlue \ \  
%      \nBall\end{array}$}] (a) {};

% \node[n,label=above:$e_2$,label=below:{
%    \relsize{-1}$\begin{array}{c}
%      \nLeft\\[-2pt]
%      \nBig\\[-2pt] 
%      \nBlue \ \  
%      \nCube\end{array}$}, right of=a] (b) {};

% \node[n,label=below:$e_3$,label=above:{
%    \relsize{-1}$\begin{array}{c}
%      \nTop \ \       \nLeft \ \ 
%      \nSmall \ \       \nGreen \ \ 
%      \nBall\end{array}$}, above of=b] (c) {};

% \node[n,label=above:$e_4$,label=below:{
%    \relsize{-1}$\begin{array}{c}
%      \nSmall\\[-2pt] 
%      \nGreen\\[-2pt] 
%      \nCube\end{array}$}, right of=b] (d) {};

% \node[n,label=above:$e_5$,label=below:{
%    \relsize{-1}$\begin{array}{c}
%      \nBig\\[-2pt] 
%      \nBlue\\[-2pt] 
%      \nBall\end{array}$}, right of=d] (e) {};

% \node[n,label=above:$e_6$,label=below:{
%    \relsize{-1}$\begin{array}{c}
%      \nBig\\[-2pt] 
%      \nGreen\\[-2pt] 
%      \nCube\end{array}$}, right of=e] (f) {};

% \node[n,label=below:$e_7$,label=above:{
%    \relsize{-1}$\begin{array}{c}
%      \nTop \ \ 
%      \nSmall\ \ 
%      \nBlue \ \  
%      \nCube\end{array}$}, above of=f] (g) {};

% \draw [aArrow,bend right=90] (b) to node[auto,swap]{\relsize{-1}$\nBelow$} (c);
% \draw [aArrow,bend right=90] (c) to node[auto,swap]{\relsize{-1}$\nOntop$} (b);

% \draw [aArrow,bend right=30] (d) to node[auto,swap]{\relsize{-1}$\nLeftof$} (e);
% \draw [aArrow,bend right=30] (e) to node[auto,swap]{\relsize{-1}$\nRightof$} (d);

% \draw [aArrow,bend right=90] (f) to node[auto,swap]{\relsize{-1}$\nBelow$} (g);
% \draw [aArrow,bend right=90] (g) to node[auto,swap]{\relsize{-1}$\nOntop$} (f);

% \draw[dotted] (-.65,-1.2) rectangle (7.1,2.1);

% \end{tikzpicture}
%\vspace*{-.4cm}\caption{Scene as a relational model}
%\label{GRE3D7-stimulus-graph}
%\end{minipage}
%\end{figure}

On termination, the algorithm computes what are called the $\mathcal{L}$-similarity classes of the input model $\gM$. Intuitively, if two elements in the model belong to the same $\mathcal{L}$-similarity class, then $\mathcal{L}$ is not expressive enough to tell them appart (i.e, no formula in $\mathcal{L}$ can distinguish them). 

The algorithm we discuss uses formulas of the $\el$ description logic language~\cite{baad:desc03} to describe refinement classes\footnote{Notice, though, that the particular formal language used is independent of the main algorithm, and different add$_{\mathcal{L}}$(R,$\varphi$,\RE) functions can be used depending on the language involved.}. 
For a detailed description of $\el$, we refer to~\cite{baad:desc03}.  
The interpretation of the $\el$ formula $\psi \sqcap \exists$R.$\varphi$ is the set of all elements that satisfy $\psi$ and that are related by relation R to some element that satisfy $\varphi$. 
For example, the interpretation of the formula \emph{ball} $\sqcap \exists$\emph{leftof}.\emph{cube} is the set of all balls that are on the left of some cube.  

%%We are now ready to describe Algorithms~\ref{algo:bisim-l} and~\ref{algo:bisim-add-el-over}. 

%\begin{figure}[!t]
%\small
%\centering
%\begin{algorithm}[H]
%\dontprintsemicolon
%\caption{Computing $\mathcal{L}$-similarity classes}\label{algo:bisim-l}
%\KwIn{\footnotesize A model $\gM$ and a list Rs $\in (\REL \times [0,1])^*$
% of relation symbols with their \puse\ values, ordered by \puse}
%\KwOut{\footnotesize A set of formulas \RE such that
%$\{\interp{\varphi} \mid \varphi \in \RE\}$ is the set of
%$\mathcal{L}$-similarity classes of $\gM$}

%$\RE \leftarrow \{\top\}$\tcp*[f]{\footnotesize the most general description $\top$ applies to all elements in the scene}

%\For{\em (R,R.\puse) $\in$ Rs}{
%	R.\randomuse = Random(0,1)\tcp*[f]{\footnotesize R.\randomuse is the probability of using R} \;
%        R.\incuse = (1 $-$ R.\puse) / MaxIterations\tcp*[f]{\footnotesize R.\puse\ are incremented by R.\incuse in each loop}
%}

%\Repeat{\em $\forall$((R,R.\puse) $\in$ Rs).(R.\puse $\ge$ 1)\tcp*[f]{\footnotesize R.\puse\ are incremented until they reach 1}}{
%  \While(\tcp*[f]{\footnotesize while some class has at least two elements}){\em $\exists (\varphi \in$ \RE)$.(\#\interp{\varphi}>1)$}{
%      \RE' $\leftarrow$ \RE \tcp*[f]{\footnotesize make a copy for future comparison} \;
%      \For{\em (R, R.\puse) $\in$ Rs}{
%          \If(\tcp*[f]{\footnotesize R will be used in the expression}){\em R.\randomuse $\le$ R.\puse}{
%              \lFor{\em $\varphi \in$ \RE}{
%                  add$_\mathcal{EL}$(R, $\varphi$, \RE)\tcp*[f]{\footnotesize refine all classes using R}}
%                  }\;
%              \If(\tcp*[f]{\footnotesize the classification has changed}){\em \RE $\not =$ \RE'}{exit\tcp*[f]{\footnotesize exit for-loop to try again highest R.\puse}}
%              }
%     \If(\tcp*[f]{\footnotesize the classification has stabilized}){\em \RE $=$ \RE'}{exit\tcp*[f]{\footnotesize exit while-loop to increase R.\puse}}
%  }
%  \lFor{\em (R,R.\puse) $\in$ Rs}{
%    R.\puse $\leftarrow$ R.\puse $+$ R.\incuse\tcp*[f]{\footnotesize increase R.\puse}
%  }
%}
%\end{algorithm}

%\begin{algorithm}[H]
%\dontprintsemicolon
%\caption{add$_\el$(R, $\varphi$, \RE)} \label{algo:bisim-add-el-over}


%\begin{figure}[t]
%\small
%\centering
%\begin{algorithm}[H]
%\dontprintsemicolon
%\caption{add$_\el$(R, $\varphi$, \RE)} \label{algo:bisim-add-el-over}

%\If(\tcp*[f]{\footnotesize are we in the first loop?}){\em FirstLoop?}{
%    Informative $\leftarrow$ TRUE \tcp*[f]{\footnotesize allow overspecification}}
%\lElse(\tcp*[f]{\footnotesize informative: smaller than the original?}) {Informative $\leftarrow$ $\interp{\psi \sqcap \exists \mbox{\em R}.\varphi} \neq \interp{\psi}$} 
%\For{\em $\psi \in$ \RE with $\#\interp{\psi} > 1$}{
%  \If{\em $\psi \sqcap \exists$R.$\varphi$ is not subsumed in \RE\ {\bf and} \tcp*[f]{\footnotesize non-redundant: can't be obtained from \RE?}\\
%    \em \ \ \ $\interp{\psi \sqcap \exists \mbox{\em R}.\varphi} \neq \emptyset$ {\bf and} \tcp*[f]{\footnotesize non-trivial: has elements?}\\
%     \ \ \  \emph{Informative}}{
%    add $\psi \sqcap \exists \mbox{R}.\varphi$ to $\RE$ \tcp*[f]{\footnotesize add the new class to the classification} \;
%    remove subsumed formulas from $\RE$ \tcp*[f]{\footnotesize remove redundant classes}
%  }
%}
%\end{algorithm}
%\vspace*{-.5cm}\caption{Refinement algorithm with probabilities and overspecification for the \el-language}\label{fig:algo3}

%\end{figure}

Algorithm~\ref{algo:bisim-l} takes as input a model and a list Rs of pairs (R,R.\puse) that links each relation R $\in \REL$, the set of all relation symbols in the model,  to some probability of use R.\puse. 
I.e., if $\REL$ is the set of all relation symbols in the model then Rs $\in (\REL \times [0,1])^*$. We assume Rs to be ordered by R.\puse. 

The set $\RE$ will contain the formal description of the refinement classes and it is initialized by the most general description $\top$.  
For each R, we first compute R.\randomuse, a random number in [0,1].  If R.\randomuse $\le$ R.\puse\ then we will use R to refine the set of classes.  The value of R.\puse\ will be incremented by R.$\incuse$ in each main loop, to ensure that all relations are, at some point, considered by the algorithm.  This ensures that a referring expression will be found if it exists; but gives higher probability to expressions using relations with a high R.\puse. 

While $\RE$ contains descriptions that can be refined (i.e., classes with at least two elements) we will call the refinement function add$_\mathcal{L}$(R,$\varphi$,$\RE$) successively with each relation in Rs. A change in one of the classes, can trigger changes in others. For that reason, if $\RE$ changes, we exit the \textbf{for} loop to start again with the relations of higher R.\puse. If after trying to refine the set with all relations in Rs, the set $\RE$ has not changed, then we have reached a stable state (i.e., the classes described in $\RE$ cannot be further refined with the current R.\puse\ values). We will then increment all the R.\puse\ values and start the procedure again. 

Algorithm~\ref{fig:algo1} almost coincides with the one in~\cite{arec2:2008:Areces}.  The \textbf{for} loop will refine each descriptions in $\RE$ using the relation R and the other descriptions already in $\RE$, under certain conditions. The new description should be \emph{non-redundant} (it cannot be obtained from classes already in $\RE$), \emph{non-trivial} (it is not empty), and \emph{informative} (it does not coincide with the original class).  If these conditions are met, the new description is added to $\RE$, and redundant descriptions created by the new description are eliminated. The \textbf{if} statement at the beginning of Algorithm in Figure~\ref{fig:algo2} disregards the informativity test during the first loop of the algorithm allowing overspecification.    

Suppose fixed an input model $\gM$ and values for Rs, and fix also some target element $t$.  Assume also that $t$ indeed has an $\el$-referring expression.  Upon termination, Algorithm~\ref{fig:algo1} will compute an $\el$ formula $\varphi$ such that $\interp{\varphi} = \{t\}$, but $\varphi$ might be different in each run of the algorithm (even though $\gM$ and Rs are fixed).  If we repeat this experiment a statistically significant number of times, we can define an estimate of the probability distribution of the REs generated by the algorithm for $t$, given $\gM$ and Rs. In Section~\ref{sec:evaluation} we will show that given a corpus of REs for $\gM$, it is possible to define R.\puse\ values so that this probability distribution matches with good accuracy the probability distribution of REs found in the corpus.  

